For instructions, see the page of the class

rm(list = ls())

1. Introduction

Context

We focus on nonparametric kernel regression and consider the Mean (Integrated) Squared Error as a measure of risk.

We consider the problem of estimating the following function:

f <- function(x) sqrt(x)*sin(pi*x)
curve(f, ylim = c(0, 1))

n <- 200
X <- 1:n/n

The number of observations is \(n = 200\) and we consider a fixed design: \(X_i = i/n, i=1 \dots n\). We generate a sample by adding Gaussian errors with variance 0.2 to the true function \(f\).

sigma <- 0.2
sigma2 <- sigma^2
eps <- rnorm(n, sd = sqrt(sigma2))
Y <- f(X) + eps
plot(X, Y)

Nadaraya-Watson estimator

The Nadaraya-Watson (NW) estimator of \(f\) at \(x \in [0,1]\) associated to the kernel \(K\) is:

\[\hat{f}_n(x) = \frac{\sum_{i=1}^n Y_i K\left(\frac{X_i - x}{h}\right)}{\sum_{i=1}^n K\left(\frac{X_i - x}{h}\right)}, \]

where \(h>0\) is called the bandwidth and \(K: \mathbb{R} \to \mathbb{R}\) is a kernel. We focus here on the Gaussian kernel.

MSE and MISE

The Mean Squared Error (MSE) at a point is defined by:

\[MSE(x) = \mathbb{E}_f \left[ \left(\hat{f_n}(x) - f(x)\right)^2 \right],\]

and the integrated risk, the Mean Integrated Squared Error (MISE), is defined by:

\[MISE = \mathbb{E}_f \left[ \int_0^1 \left(\hat{f_n}(x) - f(x)\right)^2 dx\right]\]

Just like during the course, we consider a discretized version of the MISE:

\[MISE^D = \mathbb{E}_f \left[\frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2 \right]\]

R functions for nonpametric regression

There are two main R functions for kernel nonparametric regression:

  • ksmooth implements the Nadaraya-Watson kernel estimate with rectangular or Gaussian kernel.
  • locpoly implements local polynomial estimators with Gaussian kernel.

We will only use the function locpoly. It is part of the package KernSmooth, so we load this package:

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

We begin by looking at the help pages of these functions:

# ?locpoly

The three most important arguments for us are x, y, bandwidth, and degree.

We will work with the following custom function:

NW <- function(x, y, bandwidth, gridsize = length(x), ...) {
  locpoly(x, y, degree = 0, bandwidth = bandwidth, gridsize = gridsize, ...)
}

2. Nadaraya-Watson estimator

We calculate the Nadaraya-Watson estimator and plot it along with the data points:

fit <- NW(X, Y, bandwidth = 0.3)
plot(X, Y)
curve(f, add = TRUE, col = 2, lty = 2)
lines(fit, col=3)
lgd <- c("data points", "true f", "NW estimation")
legend("bottom", lgd, col = 1:3, lty = c(NA, 2, 1), pch = c(1, NA, NA))

Question 1: influence of the bandwidth

Let us admit for a while that a “good choice” for the bandwidth in the NW estimator is \(h^\star = 0.057\). In order to further asses the influence of the bandwidth on the quality of estimation, we compare this choice to two other choices \(h = 1/C \times h^\star\) and \(h = C \times h^\star\), for \(C=10\).

par(lwd = 2)
plot(X, Y, col = "lightgray")
curve(f, add = TRUE, col = 1, lty = 2)
best <- 0.057
C <- 10
bwd <- c(1/C, 1, C)*best
for (kk in 1:length(bwd)) {
    fit <- NW(X, Y, bandwidth = bwd[kk])
    lines(fit, col = 1 + kk)
}
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("bottom", lgd, col = 2:4, lty = 1)

  1. Run the same code with smaller values for \(C\).
C_list <- c(0.1, 0.5, 1, 5)
par(mfrow = c(1,1))
for ( C in C_list )
{
  plot(X, Y, col = "lightgray", main = paste("C = ", C))
  curve(f, add = TRUE, col = 1, lty = 2)
  bwd <- c(1/C, 1, C)*best
  for (kk in 1:length(bwd)) {
    fit <- NW(X, Y, bandwidth = bwd[kk])
    lines(fit, col = 1 + kk)
  }
  lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
  legend("topright", lgd, col = 2:4, lty = 1)
}

  1. Comment on the influence of the bandwidth on the quality of estimation.

For the smallest bandwidth values, the estimated function fluctuates a lot (ex. the blue curve for C = 0.1). This is because for a low bandwidth, the estimator makes calculations based on a very narrow interval and returns a function which sticks to local observations too much.

For the largest bandwidth values, the estimated function is almost constant (ex. the blue curve for C = 5). Indeed, with a very large bandwidth, the estimators uses intervals that are too wide, it doesn’t take into account local observations enough.

As a conclusion:

  • With a bandwidth that is too small, the model is overfitting the data

  • With a bandwidth that is too large, it is underfitting the data

The objective is to find the optimal bandwidth.

3. Estimation of MSE and MISE

The goal of this section is to estimate the optimal bandwidth for the NW estimator, and justify the choice \(h^\star = 0.057\) in the preceding section. We measure the quality of estimation by the discretized MISE:

\[MISE^D = \mathbb{E}_f \left[ \frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2 \right]\]

If \(f\) was known, then we could estimate the discretized MISE by the following quantity:

\[\frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2\]

Question 2: theoretical bandwidth

To find the optimal bandwidth, we simply minimize MISE_NW as a function of \(h\).

opt <- optimize(MISE_NW, X, Y, f, interval = c(0, 1))
best <- round(opt$minimum, 3)
best
## [1] 0.05
  1. Can the above be done in practice, ie given only the observations X and Y?

The above calculation can’t be done in practice as it requires the knowledge of the function f. In our case, f is the objective so we can’t use it in the estimator.

  1. Estimate the above optimal bandwidth from another simulation run with \(n=2,000\). Are these results consistent with the theoretical upper bounds obtained during the course? [Hint: recall the form of the optimal bandwidth for a kernel of order 1].

First, let’s generate new X and Y with 2000 observations

X_2000 <- 1:2000/2000
eps <- rnorm(n, sd = sqrt(sigma2))
Y_2000 <- f(X_2000) + eps

plot(X_2000, Y_2000)

And calculate the new optimal bandwidth.

opt_2000 <- optimize(MISE_NW, X_2000, Y_2000, f, interval = c(0, 1))
best_2000 <- round(opt_2000$minimum, 3)
paste ("Optimal bandwith with 2000 observations = ", best_2000)
## [1] "Optimal bandwith with 2000 observations =  0.022"

Then plot the estimated function f as well as the real f.

plot(X_2000, Y_2000, col = "lightgray", main = paste("f vs estimated f with optimal bandwidth = ", best_2000))
curve(f, add = TRUE, col = 1, lty = 2)
bwd <- best_2000

fit <- NW(X_2000, Y_2000, bandwidth = bwd)
lines(fit, col ="red")

The estimated curve is really close to the real function f. The estimator did a good job at estimating the function.

We know that the gaussian kernel has an order equal to 1. Based on the 2 measurments of the optimal bandwidth, we can estimate the beta coefficient ad check that its integer part is equal to 1. We know that \[h_n^* = C_{te} \times n^{\frac{-1}{2\beta+1}}\] Then as we’ve measured \(h^*\) for \(n=200\) and \(n=2000\) we can estimate \(\beta\) \[\beta = \frac{1}{2} \times (1 - \frac{\ln(\frac{n_1}{n_2})}{\ln(\frac{h_{n1}^*}{h_{n2}^*})})\] We first estimate a mean optimal bandwidth over various samples so that it isn’t too reliant on the noise. Then we calculate the beta coefficient.

measure_mean_opt_bwd <- function (n, nRep) {
  best_bwd_list <- c()
  for (i in 1:nRep) {
    X <- 1:n/n
    eps <- rnorm(n, sd = sqrt(sigma2))
    Y <- f(X) + eps
    opt <- optimize(MISE_NW, X, Y, f, interval = c(0, 1))
    best <- round(opt$minimum, 3)
    
    best_bwd_list[length(best_bwd_list) + 1] <- best
  }
  return (mean(best_bwd_list))
}
n_1 <- 200
n_2 <- 2000
h_n1 <- measure_mean_opt_bwd(n_1, 10)
h_n2 <- measure_mean_opt_bwd(n_2, 10)
beta <- 0.5 * (1 - log(n_1 / n_2) / log(h_n1 / h_n2) )
print(paste ("Estimated beta coefficient = ", round(beta,3)))
## [1] "Estimated beta coefficient =  2.642"

Even by doing a lot of simulations, we always get a beta whose entiere part is equal to 2. This is a surprising result as we would’ve expected beta = 1.

  1. Without doing further simulation runs, estimate the optimal bandwidth for \(n=20,000\) and \(n=2,000,000\).

With the \(\beta\) known, we can estimate the value of any sample size : \[h_{n2}^* = h_{n1}^* \times (\frac{n_2}{n_1})^{\frac{-1}{2\beta+1}}\]

n_3 <- 20000
h_n3 <- h_n1 * (n_3 / n_1) ^( -1 / (2*beta +1) )
print(paste("Estimated optimal bandwidth for n = ", n_3, "=", round (h_n3, 3)))
## [1] "Estimated optimal bandwidth for n =  20000 = 0.023"
n_4 <- 200000
h_n4 <- h_n1 * (n_4 / n_1) ^( -1 / (2*beta +1) )
print(paste("Estimated optimal bandwidth for n = ", n_4, "=", round (h_n4, 3)))
## [1] "Estimated optimal bandwidth for n =  2e+05 = 0.016"

Question 3: a first estimator of the MISE

We consider the following estimator of the discretized MISE:

\[\widehat{MISE}^{D,0} = \frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - Y_i\right)^2\]

This estimator can be implemented as follows:

MISE_NW_0 <- function(bandwidth, x, y) {
    fit <- NW(x, y, bandwidth = bandwidth)
    mean((fit$y - y)^2)
}

We define a grid of 100 values for the bandwidth h between 0 and 0.2.

hs <- 1:500/500

1.Calculate the above estimator for all bandwidths, and plot the result as a function of h.

mise_nw_0_list <- c()
for (h in hs){
  mise_nw_0_list[length(mise_nw_0_list) + 1] <- MISE_NW_0(h,X,Y)
}

plot(hs, mise_nw_0_list, xlab = "bandwidth", ylab = "Discretized MISE")

2.Comment the results and explain why this estimator cannot be used to estimate an optimal bandwidth.

The estimator can’t be used to find the hs that gives the minimum MISE. The lowest MISE is obtained with h -> 0 and we’ve already explained that this is not an option as it leads to an overfitting model.

However, we find a first inflexion point before \(h = 0.1\)

hs <- 1:100/500
mise_nw_0_list <- c()
for (h in hs){
  mise_nw_0_list[length(mise_nw_0_list) + 1] <- MISE_NW_0(h,X,Y)
}
plot(hs, mise_nw_0_list, xlab = "bandwidth", ylab = "Discretized MISE")

This inflexion point is around \(h \sim 0.05\) and is between two intervals where the MISE increases a lot with h.

So even if we can’t find the h that gives the minimal MISE, we can graphically assume that the optimal bandwidth is around 0.05 which is the result we calculated above.

4. The bias/variance tradeoff

Question 4: an “oracle” estimator

The estimator at Question 2 is based on the knowledge of \(f\) and a single realization \((X,Y)\). To estimate the discretized MISE when the simulation model is known, we propose in this section to replace the expectation \(\mathbb{E}_f\) in the definition of the discretized MISE by an empirical mean over simulation runs.

We start by defining a function that performs one simulation run and outputs the associated NW estimator:

fHat <- function(bandwidth, n, sigma2, f) {
    eps <- rnorm(n, sd = sqrt(sigma2))
    X <- 1:n/n
    Y <- f(X) + eps
    fit <- NW(X, Y, bandwidth = bandwidth)
    fit$y
}   

We display the result of three simulation runs:

curve(f, ylim = c(-0.1, 0.8))
lines(X, fHat(0.1,  n, sigma2, f), col = 2, lty = 2)
lines(X, fHat(0.1,  n, sigma2, f), col = 3, lty = 2)
lines(X, fHat(0.1,  n, sigma2, f), col = 4, lty = 2)

Following this idea, we generate and plot a matrix of realizations of \(\hat{f}\):

par(mfrow = c(1,1))
bandwidth <- 0.3
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = "lightgray")        
curve(f, add = TRUE)

  1. What represent the following distribution of green curves :
par(mfrow = c(1,1))
bandwidth <- 0.04
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 3)
curve(f, add = TRUE)

These are 100 estimations of f with a bandwidth = 0.04. This value of the bandwidth is close to the calculated optimum bandwidth, that’s why the estimations are fine.

  1. Likewise, comment on the following distribution of yellow curves :
par(mfrow = c(1,1))
bandwidth <- 0.004
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)

These are 100 estimations of f with a bandwidth = 0.004. 0.004 is a very low bandwidth and the estimated functions fluctuate a lot. As explained previoulsy, the esimator is overfitting the data.

Question 5

  1. Generate similar plots for other choices of bandwidths. Do these plots confirm your previous observations of the influence of the bandwidth?
par(mfrow = c(1,1))
bandwidth <- 0.01
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)

With a bandwidth = 0.01, we are overfitting as expected, the bandwidth is too small

par(mfrow = c(1,1))
bandwidth <- 0.5
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)

With a bandwidth = 0.2, the model is underfitting as it is too large.

  1. We can form an estimate of the MSE as follows:
diffs <- fHatMat - f(X)            ## Note: this is a bit dangerous
diffs2 <- sweep(fHatMat, 1, f(X))  ## Safer but harder to understand
max(abs(diffs2-diffs))             ## sanity check
## [1] 0
str(rowMeans(diffs^2))
##  num [1:200] 0.162 0.161 0.159 0.157 0.155 ...

We encapsulate the above in a function:

MSE <- function(bandwidth, n, sigma2, f, nRep) {
    fHatMat <- replicate(nRep, fHat(bandwidth, n, sigma2, f))
    diffs <- sweep(fHatMat, 1, f(X))
    mse <- rowMeans(diffs^2)
}

We recall that we have the following bias-variance decomposition:

\[ MISE = \int_0^1 b^2(x) dx + \int_0^1 \sigma^2(x) dx,\]

where \(b(x) = \mathbb{E}_f \left[\hat{f_n}(x) \right]- f(x)\) and \(\sigma^2(x) = \mathbb{E}_f \left[(\hat{f_n}(x) - \mathbb{E}_f (\hat{f_n}(x)))^2 \right]\).

What do the following quantities MSHh1,2,3 and Estim1,2,3 contain ?

MSEh1 <- MSE(0.3, n, sigma2, f, 1001)
MSEh2 <- MSE(0.04, n, sigma2, f, 101)
MSEh3 <- MSE(0.003, n, sigma2, f, 101)
Estim1=sum(MSEh1)/n
Estim2=sum(MSEh2)/n
Estim3=sum(MSEh3)/n
Estim1
## [1] 0.0376379
Estim2
## [1] 0.001632265
Estim3
## [1] 0.02010372

MSEh1.2.3 are vector of size n containing the MSE for each observation over the nRep contained in fhatmat. Estim1,2,3 are scalars equal to the mean of the n MSE calculated over each observation. Estim is the MISE - MSEh1 is the vector of n MSE for a bandwidth \(h1 = 0.3\) and 1000 repetitions. This is a large bandwidth and the model is underfitting, that’s why the MSE is large. - MSEh2 is the vector of n MSE for a bandwidth \(h2 = 0.04\) and 100 repetitions. This is close to the optimal bandwidth and the MSE is low. - MSEh3 is the vector of n MSE for a bandwidth \(h3 = 0.003\) and 100 repetitions. This is a low bandwidth and the model is overfitting. That’s why the MSE is larger than with an optimal bandwidth.

Question 6

  1. Can this estimator of MSE be used in practice? No it can’t be used in practice, because it is once again using f to calculate b.

  2. Estimate \(b^2:=\int_0^1 b^2(x)dx\), \(\sigma^2:=\int_0^1 \sigma^2(x)dx\) and plot them as a function of \(h\) on the same figure along with \(MISE\). Comment on this figure in terms of bias-variance tradeoff.

calculate_bias_and_variance <- function (nRep, X_vect) {
  bias_list <- c()
  var_list <- c()
  
  n <- length(X_vect)
  
  for (h in hs){
    fHatMat <- replicate(nRep, fHat(h, n, sigma2, f))
    
    FHatMat_esp <- rowMeans(fHatMat)
    bias <- FHatMat_esp - f(X_vect)
    
    diffs <- sweep(fHatMat, 1, FHatMat_esp)
    var <- rowMeans(diffs^2)
    
    bias_list[length(bias_list) + 1] <- sum(bias**2)/n
    var_list[length(var_list) +1 ] <- sum(var)/n
  }
  
  return( list ("bias" = bias_list, "var" = var_list) )
}

nRep <- 101
hs <- 1:100/500
bias_and_var <- calculate_bias_and_variance(nRep, X)
bias_list <- bias_and_var$bias
var_list <- bias_and_var$var
mise_list <- bias_list + var_list

plot(x = hs, y = bias_list, col = 2, ylab = "", type = "l", lwd = 2)
points(x = hs, y = var_list, col = 3, type = "l", lwd = 2)
points(x=hs, y = mise_list, col = 4, type = "l", lwd = 2)
legend(x = "topleft", lty = 1 , legend = c("Bias", "Variance", "MISE"), col = 2:4)

h_opt_mise <- hs[which.min(mise_list)]
print(paste("The bandwidth that minimizes the mise as bias + variance is", h_opt_mise))
## [1] "The bandwidth that minimizes the mise as bias + variance is 0.052"
print(paste("REMINDER : The bandwidth that was estimated previously as optimum was", best))
## [1] "REMINDER : The bandwidth that was estimated previously as optimum was 0.05"

5. Boundary effects

We now focus at the pointwise risk (MSE). We can visualize the influence of the choice of \(h\) directly on MSE. Again, we compare the “optimal” bandwidth to two other choices \(h = 1/C \times h^\star\) and \(h = C \times h^\star\), for a given \(C>1\).

Execute then explain what represent the following plot. You may change C also.

par(lwd = 2)
C <- 3
plot(X, MSE(best/C, n, sigma2, f, 50), t = 'l', ylab = "MSE", col = 2, ylim = c(0, 0.1))
lines(X, MSE(best, n, sigma2, f, 50), col = 3)
lines(X, MSE(C*best, n, sigma2, f, 50), col = 4)
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("top", lgd, col = 2:4, lty = 1)
abline(h=0, lty = 2)

This plots represents the MSE as a function of X: For each X, nRep evaluations of the function f are made, and a MSE is calculated from these nRep values. The plots show that the MSE is really low from 0.2 to 0.8 whatever h, but on the boundaries of the interval, it increases.

Let’s try it with other values for C:

par(lwd = 2, mfrow = c(1,1))
C_list <- c(1,2,4, 8)
for (C in C_list) {
  plot(X, MSE(best/C, n, sigma2, f, 50), t = 'l', ylab = "MSE", col = 2, ylim = c(0, 0.1), main = paste("C = ", C))
  lines(X, MSE(best, n, sigma2, f, 50), col = 3)
  lines(X, MSE(C*best, n, sigma2, f, 50), col = 4)
  lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
  legend("top", lgd, col = 2:4, lty = 1)
  abline(h=0, lty = 2)
  }

The boundary effects appear whatever the value of C is. We can even notice that for the greatest C, the MSE also increases in the center of the interval. That is because a large C implicates very small (h/C) or large (h x C) bandwidth which are situations of overfitting or underfitting.

Question 7

  1. Is the estimated MSE for the optimal bandwidth uniformly low or do you see a systematic bias?
par(lwd = 2, mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, f, 50), col = 1)
lgd <- c("h = h*")
legend("top", lgd, col = 1, lty = 1)
abline(h=0, lty = 2)

We notice the boundary effects commented in the previous question. There is a systematic bias near the boundaries of the interval [0,1]. Appart from these boundary effects, the MSE is systematically low.

  1. Recalling the shape of the true function \(f\) (see below), and recalling that we are studying the Nadaraya-Waston estimator, can you propose an explanation for what happens near the boundary of the interval [0,1]?
bandwidth <- best
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)

The NW estimator can be seen as a weighted average of the \(Y_i\) where the weights are decided by a Gaussian kernel. The further the points are from \(x\), the lower the weight in the estimation.

For instance, with \(x = 1/2\), a large part of the interval is taken into account to calculate the estimator. Moreover, this is the central zone in which the function is maximum so a large part of the information is used.

But in the boundaries, a small proportion of the \(X_i\) is really used because half of the gaussian centred on 0 or 1 is off limits. Moreover, the function f is very close to 0 for \(x \sim 0\) or \(x \sim 1\). So the majority of the observations information is very low weighted and not taken into account for the calculation of the estimator.

That’s why the estimator isn’t great at the boundaries for this function.

  1. In particular, try to figure out what happens with the estimator when the bandwidth is rather large, and illustrate it.

When the bandwidth is rather large, then the variance of the gaussian kernel increases. That means that for a given \(x\), more weight will be given to the \(X_i \sim x\) that are far from \(x\). If the variance increases too much, then the weight given to points that are too far is too important and the model underfits.

We expect the boundary effect to be more important in that case as explained in the previous question.

bandwidth <- best * 2
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)

par(lwd = 2, mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, f, 50), col = 1)
lgd <- c("h = 2 * h*")
legend("top", lgd, col = 2:4, lty = 1)
abline(h=0, lty = 2)

As seen on the plot, the boundary effect increases even at \(h = 2*h_{opt}\).

6. Local polynomials

Question 8

  1. Using the above example of Nadaraya-Watson, create a function fHat_LP that generates a local polynomial regression estimator, and functions MSE_LP and MISE_LP that estimate the associated MSE and MISE, respectively.
my_NW_LP <- function(x, y, bandwidth, degree ,gridsize = length(x), ...) {
  locpoly(x, y, degree = degree, bandwidth = bandwidth, gridsize = gridsize,  ...)
}

fHat_LP <- function(bandwidth, n, sigma2, f, degree) {
    eps <- rnorm(n, sd = sqrt(sigma2))
    X <- 1:n/n
    Y <- f(X) + eps
    fit <- my_NW_LP(X, Y, bandwidth = bandwidth, degree)
    fit$y
} 

MISE_LP <- function(bandwidth, x, y, f, degree) {
    fit <- my_NW_LP(x, y, bandwidth = bandwidth, degree)
    mean((fit$y - f(fit$x))^2)
}

MSE_LP <- function(bandwidth, n, sigma2, f, nRep, degree) {
    fHatMat <- replicate(nRep, fHat_LP(bandwidth, n, sigma2, f, degree))
    diffs <- sweep(fHatMat, 1, f(X))
    mse <- rowMeans(diffs^2)
}
  1. Calculate the optimal bandwidth for local polynomial estimators of degree 0, 1, 2.
degree_list <- 0:2
optim_bandwidth_list <- c()

print(paste("REMINDER: the optimal bandwitdh estimated before was ", best))
## [1] "REMINDER: the optimal bandwitdh estimated before was  0.05"
for (degree in degree_list) {
  opt <- optimize(MISE_LP, X, Y, f, degree,interval = c(0, 1))
  best_bwd <- round(opt$minimum, 3)
  print( paste ("The optimal bandwidth for LP estimators of degree ", degree, " = ",best_bwd) )
  optim_bandwidth_list[ length(optim_bandwidth_list) + 1 ] <- best_bwd
}
## [1] "The optimal bandwidth for LP estimators of degree  0  =  0.05"
## [1] "The optimal bandwidth for LP estimators of degree  1  =  0.079"
## [1] "The optimal bandwidth for LP estimators of degree  2  =  0.231"
  1. Plot the MSE associated to the optimal bandwidth, and compare the MSE at the boundary for local polynomial estimators of degree 0, 1, 2.
par(mfrow = c(1,1))

plot(X, MSE(best, n, sigma2, f, 50), col = 1, type = "l")
abline(h=0, lty = 2)

for ( i in 1:length(degree_list) ){
  
  points(X, MSE_LP(optim_bandwidth_list[i], n, sigma2, f, 50, degree_list[i]), col = 1 + i,
       main = paste("Degree = ", degree_list[i]), ylab = "MSE LP", ylim = c(0,0.2), type = "l", lwd = 2)
  abline(h=0, lty = 2)
  
}

legend( x = "topleft", legend = c("Previous MSE", "LP deg 0", "LP deg 1", "LP deg 2"), col = 1:4, lty = 1)

Even though it still exists, the boundary effects decreases significantly with a LP polynom of degree 1 or 2.

6. Cross-validation

Recall that a linear nonparametric estimator is an estimator that may be written as

\[ \hat{f}_{n,h}(x) = \sum_{i=1}^{n} Y_i W_{ni}(x, h) \]

(It is linear in \(Y\).) A nice property of linear NP estimators is that the leave-one-out cross-validation risk may be explicitly written as

\[CV(h) = \frac{1}{n} \sum_{i=1}^n \left(\frac{Y_i - \hat{f}(X_i)}{1-W_{ni}(X_i, h)} \right)^2\]

Local polynomial estimators are instances of linear NP estimators.

In the case of local polynomials, the matrix of weights \(W=(W_{ni}(X_j))_{i,j}\) may be obtained using a simple R function

getW <- function(X, h) {
    n <- length(X)
    W <- matrix(NA, n, n)
    for (ii in 1:n){
        Yi <- rep(0, n)
        Yi[ii] <- 1
        fit <- locpoly(X, Yi, bandwidth = h, gridsize = n)
        W[ii, ] <- fit$y
    }
    W
}

Question 9

  1. explain why the above code does indeed calculate \(W\)

In this code, W is filled line by line. for the line ii, the weights are calculated thanks to a vector Y built such as \(Y_i[ii] = 1\) but \(Y_I[i] = 0\) if \(i \neq ii\). Then the line ii is filled with terms that depend on \(X_j - X_{ii}\), where j is the column number. Each term \(W_{i,j}\) is indeed the weight between \(X_i\) and \(X_j\)

  1. write a function to estimate the cross-validation risk for a local polynomial estimator
CV_risk <- function(X, Y, h) {
  
  n <- length(X)
  W <- getW(X, h)
  CV <- 0
  
  for (i in 1:n) {
      
    fHat_Xi <- sum(Y*W[,i])
    numerator <- Y[i] - fHat_Xi
    denominator <- 1 - W[i,i]
    CV <- CV + (numerator/denominator)**2
    
  }
  
  return (CV/n)
}

CV_risk(X, Y, best)
## [1] 0.03553049
  1. compare the cross-validation bandwidth to the optimal bandwidth in terms of associated MSE.
par(mfrow = c(1,1))
CV_risk_list <- c()
for ( h in hs ) {
  CV_risk_list [length(CV_risk_list) + 1] <- CV_risk(X, Y, h)
}
plot(hs, CV_risk_list)

CV_risk_optimal_bwd <- hs[which.min(CV_risk_list)]
associated_mse_CV <- sum(MSE(CV_risk_optimal_bwd, n, sigma2, f, 50))/n 
associated_mse_opt_bwd <- sum(MSE(best, n, sigma2, f, 50))/n 
print(paste ("The bandwidth that minimizes the CV risk is ", CV_risk_optimal_bwd, "with MSE = ", associated_mse_CV ) )
## [1] "The bandwidth that minimizes the CV risk is  0.06 with MSE =  0.00177618407913173"
print( paste("REMINDER : The bandwidth that minimizes the MSE is ", best, "with MSE = ", associated_mse_opt_bwd) )
## [1] "REMINDER : The bandwidth that minimizes the MSE is  0.05 with MSE =  0.0018254215669124"
  1. which of the two should be used in practice?

The bandwidth obtained thanks to the Cross Validation should be used as it was calculated with a method that does not require f to be known.

7. Derivative

The function locpoly can also calculate derivatives of the regression:

fit <- locpoly(X, Y, drv = 1, bandwidth = 0.1, gridsize=length(X))
plot(fit$x, fit$y)
ff <- function(x) sin(pi*x)/sqrt(x)/2 + pi*sqrt(x)*cos(pi*x)
curve(ff, add = TRUE)

Question 10

  1. What is pictured in the above estimation plot?

It is easy to demonstrate that ff is the analytic derivative of the function f. The curve is the curve of the derivative of the function f.

The argument drv = 1 means that the locpoly function will try to predict the first derivative of the function based on the observations X and Y.

The black points are the estimated derivative of the function.

  1. Evaluate the MSE and MISE performance of the local polynomial estimator of the derivative of f.

Let’s simulate a realization of the function ff for the same vector X

YY <- ff(X) + eps
plot(X, YY, main = "Realization of ff")
curve(ff, add = TRUE)

Then let’s evaluate the MSE and MISE for various degrees of local polynomials

degree_list <- 0:2

df <- data.frame( "degree" = degree_list, "MSE" = rep(0, length(degree_list)), "MISE" = rep(0, length(degree_list)))

for (degree in df$degree) {
  mse_der <- sum( MSE_LP(best, n, sigma2, ff, 50, degree)) / n
  mise_der <- MISE_LP(best, X, YY, ff, degree)
  
  df[df$degree == degree, c("MSE", "MISE")] <- c(mse_der, mise_der
                                                 )
}

df
##   degree         MSE        MISE
## 1      0 0.007753179 0.008933495
## 2      1 0.002774179 0.003062984
## 3      2 0.002156477 0.001528482

We can also plot the MSE alongside the X axis

par(mfrow = c(1,1))

plot(X, MSE(best, n, sigma2, ff, 50), col = 1, type = "l")
abline(h=0, lty = 2)

for ( i in 1:length(degree_list) ){
  
  points(X, MSE_LP(optim_bandwidth_list[i], n, sigma2, ff, 50, degree_list[i]), col = 1 + i,
       main = paste("Degree = ", degree_list[i]), ylab = "MSE LP", ylim = c(0,0.2), type = "l", lwd = 2)
  abline(h=0, lty = 2)
  
}

legend( x = "topleft", legend = c("Previous MSE", "LP deg 0", "LP deg 1", "LP deg 2"), col = 1:4, lty = 1)

Appendix

Why fixing the grid size in locpoly?

By default the locpoly function returns a vector of values for \(\hat{f}(x)\) for 401 equally-spaced values of \(x\) in \([0,1]\). Here, we want \(\hat{f}(X_i)\) for \(1 \leq i \leq n\), where the \(X_i\) are n equally-spaced values in \([0,1]\):

fit <- locpoly(X, Y, bandwidth = 1)
length(fit$y)
length(fit$y) == length(X)

Our custom function locpoly2 forces the option gridsize n locpoly to match the size of the input design:

fit <- locpoly2(X, Y, bandwidth = 1)
length(fit$y)
length(fit$y) == length(X)
max(abs(X - fit$x))
## [1] 1.110223e-16